home *** CD-ROM | disk | FTP | other *** search
- A LOCKING SHIFT MECHANISM FOR THE KERMIT FILE TRANSFER PROTOCOL
-
- Christine M. Gianone
- Frank da Cruz
-
- Columbia University
- New York, NY USA
-
- DRAFT 4.2
-
- October 2, 1991
-
- ABSTRACT
-
- 7-bit communication channels remain quite common: they are in use on IBM
- mainframes, public data networks, in virtual terminal protocols like
- TCP/IP TELNET, and on any connection in which a device uses parity.
-
- The Kermit file transfer protocol achieves transparency over hostile
- communication environments by encoding all data as printable characters. In
- the 7-bit communications environment, 8-bit data is encoded in 7-bit form
- using the single shift; the "&" character acts as a prefix, meaning that the
- following character should have its 8th bit set to 1 after decoding. Kermit's
- single-shift 8th-bit quoting mechanism can add excessive transmission overhead
- to certain kinds of files, particularly text encoded in character sets like
- ISO 8859 Cyrillic, Greek, Hebrew, or Arabic, or 8-bit Japanese Kanji codes
- like EUC in which most data bytes have their 8th bit set to 1, resulting in
- 8th-bit quoting overhead up to 100%.
-
- A new locking shift mechanism is proposed to allow 8-bit data to be
- transferred more efficiently. This mechanism is an adaptation of the
- familiar Shift-In / Shift-Out scheme combined with Kermit's present
- single-shift technique, with some quoting rules added.
-
- This proposal was prompted not only by the longstanding need for increased
- efficiency in this area, but by a conference between the authors and Dr.
- Hirofumi Fujii of the Japan National Laboratory for High Energy Physics
- regarding the establishment of an official Kermit transfer syntax for
- Japanese, the subject of a separate proposal, and subsequent meetings in
- Japan. The algorithm and user interface were designed by Gianone and
- the detailed protocol design was contributed by da Cruz in the course of
- programming a trial implementation.
-
- The reader is assumed to be familiar with the Kermit file transfer protocol
- and with commonly used computer character sets.
-
- TERMINOLOGY
-
- In this proposal, the term "character" refers to an 8-bit byte, or octet,
- even if the data is encoded in a multibyte character set, or if it is not
- encoded in any character set at all (such as a binary file).
-
- An "8-bit character" is a data byte with its 8th bit set to 1. A "7-bit
- character" is one whose 8th bit is set to 0.
-
- A "control character" is a byte in the range 0-31 or 127 decimal (the "C0"
- set) or 128-159 or 255 (the "C1" set).
-
- A "printable character" is any character that is not a control character.
-
- NOTATION
-
- Numbers are written in decimal.
-
- "<XXX>" stands for an ASCII control character. "XXX" is replaced by the
- character's name, for example "<SOH>" for Start of Heading (Control-A).
-
- "<1>X" stands for an 8-bit character. The "X" can be a literal printable
- character (for example, "<1>A" is the ASCII letter A with its 8th bit set to
- 1) or a control character (for example "<1><SOH>" is a Control-A with its 8th
- bit set to 1).
-
- Similarly, "<0>X" stands for a 7-bit character.
-
- BACKGROUND
-
- The Kermit protocol presently specifies three separate prefix characters to
- be used within Kermit packets for transparency, compression, and quoting:
-
- The Control Prefix
- For transparency on serial communication links that are sensitive to
- control characters, the file sender precedes each C0 and C1 control with
- the control prefix, normally "#" (ASCII 35), and then encodes the control
- character itself by "exclusive-ORing" it with 64 decimal (i.e. inverting
- bit 6) to produce a character in the printable ASCII range. For example,
- Control-C (ASCII 3) becomes "#C" (3 XOR 64 = 67, which is the ASCII code
- for the letter C). Similarly, NUL becomes "#@", Control-A becomes "#A",
- Control-Z becomes "#Z", Escape becomes "#[", and DEL becomes "#?". The
- receiver decodes by discarding the prefix and XORing the character with
- 64 again. For example, in "#C", C = ASCII 67, and 67 XOR 64 = 3 =
- Control-C. Control prefixing is mandatory. The control prefix is also
- used for quoting prefix characters that occur in the data itself; see
- "The Prefix Quote" below.
-
- The 8th-bit Prefix
- When one or both of the two Kermit programs knows that the connection
- between them is not transparent to the 8th bit (e.g. because the Kermit
- PARITY variable is not NONE, or because the program always operates that
- way), a feature called "8th-bit prefixing" is used if the two Kermit
- programs negotiate an agreement to do so. The 8th-bit prefix is Kermit's
- single shift, normally the ampersand character "&" (ASCII 38). When the
- file sender encounters an 8-bit character, it inserts the "&" prefix in
- front of it, and then inserts the data character itself with its 8th bit
- set to 0. If the data character is a control character, it is inserted
- after the 8th-bit prefix in control-prefixed form. Examples: an "A" with
- its 8th bit set to 1 ("<1>A") becomes "&A"; a Control-A with its 8th bit
- set to 1 ("<1><SOH>") becomes "A".
-
- The Repeat-Count Prefix
- The repeat-count prefix provides a simple form of data compression. It
- is used only when both Kermit programs support this feature and agree to
- use it. This prefix, normally tilde "~" (ASCII 126), precedes a repeat
- count, which can range from 0 to 94. The repeat count is encoded as a
- printable ASCII character in the range SP (32) - tilde (126) by adding
- 32. For example, a series of 36 G's would be encoded as "~DG" (D = ASCII
- 68 - 32 = 36). The repeat-count prefix applies to the following prefixed
- sequence, which may be a single character ("~DG"), an 8th-bit prefixed
- character ("~D&G" = 36 Control-G characters with their 8th bits set to
- 1), a control-prefixed character ("~D#M" = 36 Control-M's), or an
- 8th-bit-and-control-prefixed character ("~~Z" = 94 Control-Z's with
- their 8th bits set to 1).
-
- The Prefix Quote
- The control prefix, normally "#", is also used to quote the control prefix
- itself if it occurs in the data: "##", means that the "#" character should
- be taken literally. If 8th-bit prefixing is in effect, the control prefix
- also quotes the 8th-bit prefix: "#&", so "#&D" stands for "&D" rather than
- "<1>D". If repeat count prefixing is in effect, the control prefix is also
- used to quote the repeat count prefix: "#~", so "#~CG" stands for "~CG"
- rather than 35 "G" characters. So the complete meaning of the "#" prefix
- is: if the value of the following character is 63-95 or 191-223, the
- prefixed character is to be XORed with 64, otherwise it is to be taken
- literally. The prefix quote can also be used harmlessly to quote 8th-bit
- or repeat-count prefixing characters even when these types of prefixing are
- not in effect.
-
- On a 7-bit connection the file sender, after encoding the data, adds the
- appropriate parity bit to all characters -- prefixes as well as data -- before
- transmission, and the file receiver strips the parity bit from all received
- characters before processing them.
-
- On an 8-bit-clean connection, 8th-bit prefixing need not be (and normally is
- not) done, and data characters retain their original 8th bit. For example,
- "A" with its 8th bit set to 1 is transmitted literally, without any
- prefixing ("<1>A"). Control-A with its 8th bit set to 1 is transmitted as
- "#" followed by the letter A with its 8th bit set to 1 ("#<1>A") because
- control prefixing is always in effect.
-
- SINGLE AND LOCKING SHIFTS
-
- The shift key on a typewriter lets the regular keys do "double duty". A
- given key produces different results depending on whether the shift key is
- up or down. Kermit's single shift (8th-bit prefix) is like the shift key:
- just as you must press two keys on the typewriter for every uppercase
- letter, Kermit must send two 7-bit characters for every 8-bit character when
- 8th-bit prefixing is in effect.
-
- Certain types of files have many 8-bit characters in a row. When this is
- the case, the overhead of single shifting could be as high as 100%.
- Efficiency could be much improved by the use of "locking shifts": the file
- sender tells the file receiver "Here comes a sequence of 8-bit characters"
- and then sends these characters in 7-bit form, relying on the receiver to
- put their 8th bits back before storing them.
-
- The locking shift behaves like the shift-lock key on a typewriter: to type a
- series of uppercase letters, you press the shift lock key once and then type
- the letters, one key per letter, rather than two. To go back to lowercase
- letters, release the shift lock key and then type more letters.
-
- When the data communications "shift-lock" key is active, 7-bit characters
- are said to be "shifted": they are not what they appear to be, but instead
- represent 8-bit characters. When the locking shift is not in effect, 7-bit
- characters stand for themselves; they are "unshifted".
-
- The locking shift characters are SO (Shift Out, Control-N, ASCII 14), and SI
- (Shift In, Control-O, ASCII 15). SO is sent at the beginning of a shifted
- sequence, SI is sent to return to normal unshifted operation. For example,
- on a 7-bit connection, the following string of characters (written using our
- notation):
-
- <0>A<0>B<0>C<1>D<1>E<1>F<1>G<1>H<1>I<0>J<0>K<0>L<0>M (13 characters)
-
- would be transmitted like this with single shifts:
-
- ABC&D&E&F&G&H&IJKLM (19 characters)
-
- and like this with locking shifts:
-
- ABC<SO>DEFGHI<SI>JKLM (15 characters)
-
- On an 8-bit connection, of course, the string of 13 characters can be
- transmitted as-is, with no overhead at all.
-
- Now suppose we have the following character sequence:
-
- <1>A<1>B<1>C<0>D<1>E<1>F<1>G<0>H<1>I<1>J<1>K<0>L<1>M (13 characters)
-
- Here several isolated 7-bit characters are found in the middle of a long run
- of 8-bit characters. Using locking shifts alone, this would be encoded as:
-
- <SO>ABC<SI>D<SO>EFG<SI>H<SO>IJK<SI>L<SO>M (20 characters)
-
- But using a combination of locking and single shifts, it can be encoded more
- compactly, as in this example, in which "&" is the single-shift character:
-
- <SO>ABC&DEFG&HIJK&LM (17 characters)
-
- This proposal adds the locking Shift-In/Shift-Out mechanism to the Kermit
- file transfer in a way that it can be used in conjunction with single shifts
- for maximum efficiency.
-
- NEGOTIATION
-
- Locking shifts are, like all new additions, an optional feature of the
- Kermit protocol. To allow old Kermit programs to interoperate transparently
- with the new ones that implement locking shifts, the use of this feature
- must be negotiated and agreed upon by both Kermit programs before it can be
- used.
-
- Two Kermit programs agree to use the locking shift extension via a new
- capability bit, together with the existing 8th-bit prefixing (QBIN) field.
- The capabilities mask is the 10th character in the initialization string. It
- contains a bit mask encoded as a printable character by adding 32 (ASCII
- Space).
-
- Capability number 1 (bit 5, which until now has been reserved for future
- use) will be used to indicate the locking shift capability: 1 if enabled, 0
- if not. Thus old Kermits automatically disable the use of locking shifts
- because they never set this bit. The format of Kermit's capability mask is:
-
- bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0
- +----+----+----+----+----+----+----+----+
- | X | X | 1 | 2 | 3 | 4 | 5 | Z |
- +----+----+----+----+----+----+----+----+
-
- where:
-
- X = Must not be used
- 1 = Locking Shift Capability
- 2 = Extra-Long Packet Capability (9025-857374)
- 3 = Attribute Packet Capability
- 4 = Sliding Window Capability
- 5 = Long Packet Capability (95-9024)
- Z = Capability Mask Extension Bit (allows addition of new mask bytes)
-
- The locking shift protocol is used if and only if:
-
- 1. The file sender sets the Locking Shift Capability bit in the S (Send
- Initialization) packet;
-
- 2. The file receiver also sets the same bit in its acknowledgement to the S
- packet; and
-
- 3. The parties have agreed to use single shifts via the QBIN field.
-
- Thus, locking shifts REQUIRE 8th-bit prefixing. This is reasonable because
- (a) 8th-bit prefixing is easy to program; (b) all the popular Kermit programs
- already implement it; (c) little is gained by using locking shifts without
- single shifts; (d) it simplifies the user interface and the negotiation
- process; and (e) it allows the file receiver as well as the sender to request
- locking shifts.
-
- ENCODING RULES
-
- Kermit's locking shift protocol uses the C0 control character Shift Out (SO,
- Control-N, ASCII 14) to precede a sequence of 8-bit characters, and Shift In
- (SI, Control-O, ASCII 15) to precede a sequence of 7-bit characters.
-
- Whether or not locking shift protocol is in effect, all of Kermit's normal or
- negotiated prefixing rules also remain in effect, so SO appears in the packet
- as "#N" and SI appears as "#O".
-
- Each Kermit program maintains a SHIFT-STATE, which may be SHIFTED (shifted
- out) or UNSHIFTED (shifted in). SHIFTED means that 8-bit characters are
- being transmitted in 7-bit form (preceded by a Shift-Out character) and
- UNSHIFTED means that 7-bit characters represent themselves. For each file,
- the initial SHIFT-STATE is defined to be UNSHIFTED, so there is no need for
- the sender to transmit an initial Shift-In (but it does no harm).
-
- A. When the file sender's SHIFT-STATE is UNSHIFTED and it reads a 7-bit
- character, it adds the character to the packet according to Kermit's
- other prefixing rules (control and repeat count), and adds the
- appropriate parity bits. Thus, any number of 7-bit characters can be
- transmitted in a row.
-
- B. When the file sender's SHIFT-STATE is UNSHIFTED and it reads an 8-bit
- data character, there are two possibilities:
-
- 1. If single-shifting (8th-bit prefixing) is in effect, insert a
- single-shift character ("&") with the appropriate parity bit before
- the 8-bit data character, and add the data character itself with its
- 8th bit replaced by the appropriate parity bit.
-
- OR:
-
- 2. Insert a Shift Out (SO) character into the packet (encoded as "#N"
- with the appropriate parity bits), change the SHIFT-STATE to SHIFTED,
- and then add the data character with its 8th bit replaced by the
- appropriate parity bit.
-
- C. When the file sender's SHIFT-STATE is SHIFTED and it reads an 8-bit
- character, it adds the character to the packet according to Kermit's
- other prefixing rules (control and repeat count), replacing the
- character's 8th bit by the appropriate parity bit. Thus, any number of
- 8-bit characters may be transmitted in a row in 7-bit form after the SO.
-
- D. When the file sender's SHIFT-STATE is SHIFTED and a 7-bit character is
- encountered, there are two possibilities:
-
- 1. If single-shifting is in effect, insert a single-shift character ("&")
- before the 7-bit character and add the appropriate parity bits.
-
- OR:
-
- 2. Insert a Shift-In (SI) character (encoded as "#O" with the appropriate
- parity bits) into the packet and change the SHIFT-STATE to UNSHIFTED,
- and then insert the data character itself with the appropriate parity
- bit.
-
- E. If a repeated sequence of characters occurs where the shift state changes,
- the locking shift is encoded BEFORE the repeat-count sequence: #O~xA,
- not ~x#OA.
-
- F. If the file ends in SHIFTED state, there is no need to issue a Shift-In
- code at the end of the file, but it does no harm either.
-
- SINGLE AND LOCKING SHIFTS
-
- When locking shifts and single shifts are in effect, the meaning of the
- single-shift character is reversed when the SHIFT-STATE is SHIFTED. Single
- shifts can be used to efficiently encode isolated characters that don't fit
- the current SHIFT-STATE. For example:
-
- Data Encoding
-
- 1. ABCABC<1>EBCABC ABCABC&EBCABC
- 2. <1>A<1>B<1>C<1>A<1>BXY<1>B<1>C<1>A #NABCAB&X&YBCA
-
- In (1) the single shift "&" sets the 8th bit of "E" to 1 (normal Kermit
- practice), but in (2) the single shift sets the 8th bit of "X" and "Y" to 0
- because the SHIFT-STATE is SHIFTED (#N).
-
- The file sender can decide whether to use single or locking shifts by
- looking ahead in the input file data. Single shifts are more efficient when
- there are one, two, or three n-bit characters in a row; locking shifts are
- more efficient when there are five or more n-bit characters in a row (n is
- either 7 or 8):
-
- Single Shift Locking Shift
- &A (2) #OA#N (5) (worse)
- &A&B (4) #OAB#N (6) (worse)
- &A&B&C (6) #OABC#N (7) (worse)
- &A&B&C&D (8) #OABCD#N (8) (same)
- &A&B&C&D&E (10) #OABCDE#N (9) (better)
-
- Thus five-character lookahead is sufficient to make the best decision.
-
- REPEAT COUNTS AND LOCKING SHIFTS
-
- A repeated sequence of 8-bit characters that occurs while in UNSHIFTED state,
- for example abc<1>X<1>X<1>X<1>X, can be encoded by using a single shift:
-
- abc~$&X
-
- A repeated sequence of 8-bit characters that occurs while in SHIFTED
- state, for example:
-
- abc<1>A<1>B<1>C<1>X<1>X<1>X<1>X<1>X<1>X<1>X<1>X<1>D<1>E<1>F
-
- is encoded using the same repeat-count notation:
-
- abc#NABC~(XDEF
-
- Just as the # and & prefixes are used as prefixes in both UNSHIFTED and
- SHIFTED states, so is the repeat-count prefix, ~. The same sequence could
- also be encoded less efficiently as:
-
- abc#NABC#O~$&X#NDEF
-
- PREFIX CHARACTERS THAT OCCUR IN THE DATA
-
- Since Kermit prefix characters can occur within file data, they must be
- prefixed to distinguish them from true prefixes. The following encoding
- is used:
- STATE..............
- CHARACTER UNSHIFTED SHIFTED
- # ## #
- & #& &
- ~ #~ ~
- <1># # ##
- <1>& & #&
- <1>~ ~ #~
-
- QUOTING THE LOCKING SHIFT CHARACTERS
-
- Since Control-O and Control-N can appear within file data, there has to be
- a way to distinguish the use of these characters as locking shifts from
- their use as data characters.
-
- When (and only when) locking shift protocol is in effect, SO and SI
- characters that appear in the data must be prefixed by Data Link Escape
- (DLE, Control-P, ASCII 16), normally encoded as "#P". If DLE itself appears
- in the file, it too must be prefixed by DLE.
-
- The DLE character applies to the ENTIRE PREFIXED SEQUENCE that follows it.
- This may be a single character, a control-prefixed character, an 8th-bit
- prefixed character, or a repeat-count-prefixed sequence of any combination
- of these. To illustrate the difference between quoting by "#" and DLE,
- "##O" indicates a literal "#" character followed by the letter "O", whereas
- "<DLE>#O" indicates a literal Control-O. In practice, the file sender
- should use DLE only to prefix SO, SI, and itself, but the receiver should
- treat DLE as a general "prefixed sequence" quote: it should discard the DLE,
- decode the following prefixed sequence, and treat the result as data rather
- than Kermit protocol information.
-
- Should a repeated sequence of SO's, SI's, or DLE's occur within the data,
- the entire sequence may be encoded with a repeat count and prefixed by a
- single DLE, which applies to all copies of the repeated character. For
- example, "#P~A#N" indicates 33 SO characters in a row that are not to be
- treated as locking shifts.
-
- When locking shift protocol is in effect, we must handle the C1 counterparts
- of SO, SI, and DLE (that is, using our notation, <1>SO, <1>SI, and <1>DLE).
- These characters would be inserted into the packet in their 7-bit form when
- the SHIFT-STATE is SHIFTED, and the receiver would have no way of
- distinguishing a data #O from a Shift-In #O, or a data #N from a Shift-Out #N,
- or a data #P from a quoting #P. Therefore these characters too should be
- prefixed by DLE when in SHIFTED state.
-
- If a 7-bit SO, SI, or DLE appears in the data during SHIFTED state, the file
- sender can "single-shift" it in the normal manner, for example "O". The
- file receiver must treat such sequences as literal data characters, as if
- they had been prefixed by DLE, not as shifts and quotes.
-
- The rule, therefore, is that if #O, #N, and #P have no prefix of any kind,
- then they are used for shifting and quoting. When these characters are
- prefixed by either "&" or DLE, no matter what the SHIFT-STATE is, they are
- data characters:
-
- File SHIFT-STATE
- Character UNSHIFTED SHIFTED
- SI #P#O O or #PO
- <1>SI O or #PO #P#O
- SO #P#N N or #PN
- <1>SO N or #PN #P#N
- DLE #P#P P or #PP
- <1>DLE P or #PP #P#P
-
- The "O" form need not be prefixed by "#P", but no harm is done if it is.
- The packet receiver must respond to these prefixed sequences as follows:
-
- Packet SHIFT-STATE
- Sequence UNSHIFTED SHIFTED
- #O Discard* Shift Out
- #P#O Literal SI Literal <1>SI
- O or #PO Literal <1>SI Literal SI
- #N Shift In Discard*
- #P#N Literal SO Literal <1>SO
- N or #PN Literal <1>SO Literal SO
- #P Quote Quote
- #P#P Literal DLE Literal <1>DLE
- P or #PP Literal <1>DLE Literal DLE
-
- The "Discard*" entries are for when a redundant shift is received, for
- example an unprefixed Shift-Out when the Kermit receiver is already shifted
- out. Redundant shifts do not affect the current SHIFT-STATE and are not
- interpreted as data; they are simply ignored and discarded by the receiver.
-
- BOUNDARY CONDITIONS
-
- Although sequences of characters prefixed by "#", "&", or "~" may not be
- broken across packet boundaries, locking shifts are effective across packet
- boundaries. However, locking shifts are not effective across file
- boundaries; when a group of files is being transferred, the SHIFT-STATE must
- be set to UNSHIFTED at the beginning of each file.
-
- THE FILE RECEIVER
-
- The file receiver has no decisions to make, it is totally driven by the
- sequence of characters in each packet it receives. The receiver operates as
- it does without the locking shift protocol, but with additional rules: it must
- recognize the locking shift indicators "#N" and "#O", set the SHIFT-STATE to
- SHIFTED when it sees "#N" and to UNSHIFTED when it sees "O", and set the value
- of the 8th bit of each data character according to the current SHIFT-STATE.
- It must treat #, &, and ~ as prefix characters even when the SHIFT-STATE is
- SHIFTED, remembering that the meaning of the single-shift prefix "&" is
- inverted. (The file receiver can also store the shift characters as is -- see
- the COMMANDS section below.)
-
- COMMANDS
-
- One new command is required:
-
- SET TRANSFER LOCKING-SHIFT { ON, OFF, FORCED }
-
- The options are as follows:
-
- ON: Enables the use of locking shifts. The Kermit program sets the locking
- shift capability bit in any S or I packets it sends, or in any
- acknowledgement to an S or I packet. Locking shifts are actually used if
- and only if both Kermits set this bit AND single-shifts are successfully
- negotiated. If a Kermit program implements the locking shift protocol,
- the default TRANSFER LOCKING-SHIFT setting should be ON.
-
- OFF: Disables the use of locking shifts. The Kermit program sets the
- locking shift capability bit to zero in all negotiation packets, and
- treats SO, SI, and DLE as ordinary data characters in Kermit data
- packets.
-
- FORCED: Forces the use of locking shifts, regardless of the PARITY setting and
- capability negotiation. The file sender sets the locking shift bit in the
- capability mask, sets the QBIN (8th-bit prefix) field to "N", and ignores
- the receiver's reply. The file receiver sets the same values, regardless
- of the sender's values. A Kermit program that has been given this command
- acts as if locking shift protocol had been successfully negotiated and
- single shifts have been disabled.
-
- With these facilities and defaults in effect, the Kermit user will get
- locking shift protocol automatically whenever PARITY is not NONE and both
- Kermits support locking shifts (which implies they also support single shifts
- and that single shifts were negotiated successfully).
-
- SET TRANSFER LOCKING-SHIFT FORCED can be used to force the file sender to
- use locking shifts even if the receiver doesn't understand this protocol, or
- to force the file receiver to treat SO/SI/DLE codes in arriving files as
- prescribed by this proposal. This allows an 8-bit data file to be sent
- through a 7-bit connection to a Kermit program that does not implement
- 8th-bit prefixing or locking shifts. The result can displayed on terminals
- or printers that respond appropriately to Shift-In/Shift-Out codes, sent
- through e-mail, or postprocessed with a simple SO/SI filter to reconstruct
- it, provided the original file does not contain SO, SI, or DLE characters.
- If a file containing SO/SI codes is sent to a Kermit program with SET
- TRANSFER LOCKING-SHIFT FORCED in effect, the data is reconstructed according
- to the imbedded shifts.
-
- The SET TRANSFER LOCKING-SHIFT FORCED option is, of course, risky, and can
- result in undesired effects if used improperly. For example, if the file
- contains SO or SI characters as data, the shift state can become inverted.
- Furthermore, DLE does not serve to "quote" SO or SI characters in ordinary
- data communication; SO and SI usually act as locking shifts even when
- preceded by DLE (or any other character). For example, when the sequence
- "<SI>ABC<DLE><SO>DEF" is sent to a VT300 terminal, the DLE is ignored and
- the characters DEF are shifted.
-
- Here are the possible SET TRANSFER LOCKING-SHIFT combinations and their
- effects. The OFF entries also apply to Kermit programs that don't implement
- locking shift protocol at all:
-
- Sender Receiver Effect
- ON ON Locking shift protocol done if single shifts negotiated
- ON OFF No locking shifts
- ON FORCED SO/SI/DLE in data interpreted as shifts by receiver
- OFF ON No locking shifts
- OFF OFF No locking shifts
- OFF FORCED SO/SI/DLE in data interpreted as shifts by receiver
- FORCED ON Sender adds shifts, receiver stores them as data (*)
- FORCED OFF Sender adds shifts, receiver stores them as data
- FORCED FORCED Locking shift protocol is done with no single shifts
-
- (*) Sender announces that it WON'T do single shifts, which disables
- the receiver's locking-shift protocol.
-
- CHARACTER SET TRANSLATION
-
- SET TRANSFER LOCKING-SHIFT FORCED (or any other LOCKING-SHIFT settting)
- does not affect character set translation. Translation is still done if the
- user has elected to do it.
-
- Here are the possibilities when the sender has SET LOCKING-SHIFT FORCED and
- has announced an 8-bit transfer character set in the Attribute packet, and the
- receiver supports character-set translation, but is not doing LS protocol:
-
- 1. Receiver translates the transfer character set into an 8-bit file
- character set whose first 128 characters are ASCII, such as an IBM code
- page, KOI-8, the Apple or NeXT character set, etc. In this case, the
- desired effect is achieved automatically.
-
- 2. Receiver translates the transfer character set into a 7-bit file
- character set such as an ISO 646 NRC or Short KOI. In this case the
- result is garbage. Locking shifts should not be used here. For the
- languages covered by ISO 646 NRCs, single shifts are more efficient.
-
- 3. The receiver does not understand the transfer character set. The
- situation here is no different with locking shifts than without them.
-
- PERFORMANCE
-
- A preliminary implementation of the shifting algorithms described in this
- proposal was coded and tested on a large number of text and binary files and
- worked correctly: the result of encoding and then decoding each file was
- identical to the original. All combinations of single shift, locking shift,
- and repeat-count compression were tested successfully in both text and
- binary file mode.
-
- The following table shows the number of characters required to encode files of
- different representative types (taken from a much larger sample) using
- different combinations of single shifts (SS) and locking shifts (LS), but
- without repeat-count compression (R). For comparison, the final column
- includes repeat-count compression. The number in parentheses is the
- "expansion factor" showing how much the data grew in the encoding process.
- The .TXT files were encoded in text mode, the others were encoded in binary
- mode.
-
- File Encoding..................................................
- Name Length SS........... LS........... LS+SS........ LS+SS+R......
-
- ASCII.TXT 190689 202173 (1.06) 202126 (1.06) 202173 (1.06) 194938 (1.02)
- GERMAN.TXT 39611 42159 (1.06) 43336 (1.09) 42169 (1.06) 41558 (1.05)
- FRENCH.TXT 108021 116426 (1.08) 124446 (1.15) 116446 (1.08) 115531 (1.07)
-
- CYRILL1.TXT 52046 95700 (1.84) 80998 (1.56) 64602 (1.24) 64476 (1.24)
- CYRILL2.TXT 13699 25293 (1.85) 23429 (1.71) 18306 (1.34) 18078 (1.32)
- CYRILL3.TXT 28434 49834 (1.75) 43029 (1.51) 37104 (1.30) 35519 (1.25)
- CYRILL4.TXT 51011 89419 (1.75) 78217 (1.53) 63157 (1.24) 63010 (1.24)
- Cyrillic
- Totals 145190 260246 (1.79) 225673 (1.55) 183169 (1.26) 181083 (1.25)
-
- KANJI.TXT 29706 59494 (2.00) 32527 (1.09) 32629 (1.10) 32648 (1.10)
- KANJIA.TXT 106943 157536 (1.47) 122043 (1.14) 121822 (1.14) 118563 (1.11)
- Kanji
- Totals 136649 217030 (1.59) 154570 (1.13) 154451 (1.13) 151211 (1.11)
-
- MSVIBM.EXE 146989 247766 (1.69) 302348 (2.06) 248991 (1.69) 210598 (1.43)
- WERMIT 419861 737812 (1.76) 923451 (2.20) 760912 (1.81) 713830 (1.70)
- FILE.ZIP 96911 173145 (1.79) 226407 (2.34) 172627 (1.78) 172841 (1.78)
-
- ASCII.TXT is a plain US ASCII text file containing English prose and no
- 8-bit characters. GERMAN.TXT and FRENCH.TXT are German- and French-language
- documents coded in ISO 8859-1 Latin Alphabet 1.
-
- CYRILL1.TXT is a chapter from a Russian computer book, containing only a few
- English words. CYRILL2.TXT is a poem, The Bronze Horseman by Pushkin; its
- lines are short and there are many blank lines so there is a higher CRLF-to-
- text ratio. CYRILL3.TXT is "Murphy's Laws" in Russian, in which lines tend to
- be short, blank, or indented. CYRILL4 is a RussTeX source file in which the
- TeX commands are ASCII and the text is Cyrillic. The Cyrillic text in all
- these files is ISO 8859-5 Latin/Cyrillic 8-bit text.
-
- KANJI.TXT is a Japanese-language text file encoded in the Japanese EUC code.
- KANJIA.TXT contains a mixture of ASCII English and Japanese Kanji encoded in
- EUC.
-
- MSVIBM.EXE is an IBM PC binary executable program image. WERMIT is a SUN-4
- (Sparc) binary executable program image. FILE.ZIP is a binary MS-DOS ZIP
- archive.
-
- ANALYSIS
-
- For binary files, locking (combined) shifts generally provide no benefit
- over single shifts. These files tend to have a high percentage of bytes in
- the C0 and C1 ranges, and therefore suffer high overhead from control
- prefixing. Furthermore, they rarely have long runs of 8-bit characters.
- The reason the combined shift is less efficient than the single shift is the
- necessity to quote SO, SI, and DLE characters that occur in the data.
-
- For text files encoded in "left-handed" 8-bit character sets such as ISO
- 8859 Latin Alphabets 1-4 and 9 (for languages based on Roman characters),
- 8-bit characters generally occur only in isolation, and so locking
- (combined) shifts provide no significant benefit over single shifts.
-
- Locking and combined shifts provide a substantial performance improvement
- over single shifts for text files written in "right-handed" 8-bit character
- sets like the Latin Arabic, Cyrillic, Greek, and Hebrew alphabets where long
- sequences of 8-bit bytes predominate, and for certain multibyte character
- sets like as Japanese EUC, in which all Kanji-character bytes have their 8th
- bits set to 1.
-
- CONCLUSION
-
- The locking shift algorithm is easy to program and is inexpensive in both
- execution time and code space. Implementation of locking shift protocol is
- recommended for Kermit programs that must transfer files likely to contain
- many sequences of 5 or more consecutive 8-bit GR bytes over 7-bit
- communication channels. Such files tend to be text files encoded in the ISO
- character sets for non-Roman alphabets and in EUC Kanji codes, but there might
- be other candidates too: binary image (raster) data, spreadsheet data, etc.
- For such files, the efficiency improvement can approach 100%.
-
- REFERENCES
-
- Gianone, Christine M., "A Kermit Protocol Extension for International
- Character Sets", Columbia University (1990).
-
- da Cruz, Frank, "Kermit, A File Transfer Protocol", Digital Press (1987).
-
- ANSI X3.4 (1986), "Coded Character Sets - 7-bit American Standard Code for
- Information Interchange".
-
- ISO 2022, "Information processing - ISO 7-bit and 8-bit coded character
- sets - Code extension techniques" (1985).
-
- ISO 8859, "Information processing - 8-bit single-byte coded graphic
- character sets", parts 1-9 (1987-present)
-
- "JIS X 0212 Study Group Interim Report"
-
- ACKNOWLEDGEMENTS
-
- Thanks to John Chandler, John Klensin, Paul Placeway, and Konstantin
- Vinogradov for their detailed comments on this proposal, and to Gisbert W.
- Selke for the German file, Andre' Pirard for the French, Konstantin
- Vinogradov and Dimitri Vulis for the Russian files, and Hirofumi Fujii for
- the Japanese files.
-
- (The End)
-